install.packages(c("reshape", "ggplot2", "lubridate", "magrittr", "car"))Data Science in Environmental Science and Oceanography with R and Python
Course Notes
1 About
This guide was created for the course “Data Science in Environmental Science and Oceanography with R and Python” at the Informatica Feminale 2025.
2 Before the course
2.1 Installation Notes
If you haven’t installed R and RStudio yet, use following instructions for the installation:
- Install R:
Installation on Windows and Mac https://www.dataquest.io/blog/installing-r-on-your-computer/ For Linux https://cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-R-be-installed-_0028Unix_002dlike_0029
Install the free RStudio Desktop Version: https://www.rstudio.com/products/rstudio/#Desktop
Do you know how you can install new R packages? You can read about it here: https://thedatacommunity.org/2017/09/17/installing-loading-unloading-and-removing-r-packages-in-rstudio/
In this script, we make use of following packages: reshape, ggplot2, lubridate, magrittr, car
You can install them using following command:
Very many integrated development environment (IDE) software programs exist for Python. For this beginners course, we recommend the free and open source Python IDE called Thonny.
Install Python:
Thonny is available for Windows, Mac and Linux. Just download the suitable executable installation file for your computer and start the installation. The basic set up of Python is done. When you open Thonny for the first time, you will see the default set up of windows with a code editor on the upper left side, a shell on the lower left side and an assistant on the right hand side:Install additional Python packages:
During our course, we will use several additional packages. All can be installed either via the package installer PIP.
Example: Write the following code in the shell to install PANGAEApy.
!pip install pangaeapyOr click in the upper left hand corner of Thonny on Tools and select Manage packages …. Search jupyter on PyPI and click on the jupyter search result to start the installation.
The following packages are needed for this course:
- jupyter
- pangaeapy
- numpy
- pandas
- re
- os
- datetime
- ipdb
- requests
- LatLon23
- difflib
- collections
- matplotlib
- plotly
2.2 Programming Environments
For R, we recommend RStudio.
2.2.0.1 Creating a new project
In many statistics classes they teach you, that you should set a working directory:
setwd("~/Documents/")It can be helpful to create a project instead. All files within the project’s directory will automatically have the correct working directory. In this way you can create new files and start coding easily.
- Click File -> New Project… Even when you choose “Existing Directory” you will still be able to create a new directory
- Enter the diretory path straight away, or click on browse.
- Choose an existing directory or create a new one
2.2.0.2 RStudio screen
- In the upper left hand corner you have the code editor. This is where you can type your scripts.
- In the bottom left hand corner you have the console. Here you will see the results of all commands you are running.
- In the upper right hand corner you can see all objects you created, e.g. data frames, functions, strings, …
- In the bottom left hand corner you will be able to see the files in your project, your plots and more information e.g. about functions
You can use “Ctrl” followed by a number to jump to a specific window. If you want to get into the terminal use “Shift+Alt+t” or “Shift+Alt+m”. You can find an overview of all short cuts using “Shift+Alt+k”.
2.2.0.3 R Workflow
- You type your scripts in the code editor, so that you can save them.
- If you want to run the whole script, you can click on ‘Source’ at the tope of the code editor or press “Ctrl + Alt + R”. (Shortcuts might differ depending on your system.)
- Use “Ctrl + Alt + B/E” to run from the beginning to the current line or from the current line to the end.
- If you want to run a single line, you can click on ‘Run’ or press “Control + Enter”
- If you want to run a few lines, highlight the lines you want to run and use ‘Run’/ “Control + Enter”
- If you want to use Autocomplete, you can press “Control + Space”
- If you want to read about an existing method, you can type “?<Name of the method>” to get to the documentation
2.2.0.4 Save your code in a script or notebook
Python code is saved either in scripts with .py ending or in so called Jupyter Notebooks with .ipynb ending. There is no difference in the python code itself only in the way the code is executed.
Files with .py ending need to be executed in a shell or terminal, somewhere you type the command to execute your script. If you want to know more about the differences between a shell and a terminal, have a look at GeeksforGeeks.
Files with .ipynb ending consist of individual cells which can be executed directly in the Jupyter Notebook.
Note: If you open a Jupyter Notebook in an editor, you will see that the python code is mixed with additional code to interprete each notebook cell. You can convert .py files to .ipynb files an vice versa using a command line tool in the shell as described in the nbconvert documentation.
In this course we will work with Jupyter Notebook. To open a Jupyter environment write in the shell window of Thonny:
!jupyter notebook
After pressing enter, a browser window will open to show you your Home directory. Double click on your Jupyter Notebook of choice to open it or open a new one with the File button, select new and Notebook. Select the suggested Kernel and start scripting.
Type anything you want, e.g. print(‘Hello World’), into the cell and execute it by clicking on the button with the single triangle.
2.3 Data types
2.3.0.1 Character
firstname <- "Pipi"
surname <- "Langstrumpf"
name <- paste(firstname, surname)
name[1] "Pipi Langstrumpf"
2.3.0.2 Numeric
a <- 1
b <- 2
c <- a + b
c[1] 3
d <- b*c/a
d[1] 6
e <- d^2
e[1] 36
f <- as.integer(3.14)
f[1] 3
2.3.0.3 Logical
a <- TRUE
b <- FALSE
# or
a|b[1] TRUE
# and
a&b[1] FALSE
# not
!a[1] FALSE
4 == 5[1] FALSE
4 < 5[1] TRUE
4>=4[1] TRUE
is.nafunction (x) .Primitive("is.na")
as.numeric(a)[1] 1
2.3.0.4 && and ||
c(T,T,T) & c(T,F,F)[1] TRUE FALSE FALSE
c(T,T,T) && c(T,F,F)Error in c(T, T, T) && c(T, F, F): 'length = 3' in coercion to 'logical(1)'
IDontExistError: object 'IDontExist' not found
T | IDontExistError: object 'IDontExist' not found
T || IDontExist[1] TRUE
2.3.0.5 Vectors
# a simple numeric vector
myvector <- c(1,4,6)
myvector[1] 1 4 6
# a vector of type character
myfamily <- c("Paula", "Ali", "Emma")
myfamily[1] "Paula" "Ali" "Emma"
# get the first element of a vector
myfamily[1][1] "Paula"
# apply function to all items:
myvector + 1[1] 2 5 7
paste(myfamily, "Meier")[1] "Paula Meier" "Ali Meier" "Emma Meier"
# concatenate vectors:
longervector <- c(myvector, 5:7)
longervector[1] 1 4 6 5 6 7
# create a sequence
odd <- seq(from = 1, to = 10, by = 2)
odd[1] 1 3 5 7 9
# create a boring sequence
boring <- rep(1, 10)
boring [1] 1 1 1 1 1 1 1 1 1 1
2.3.0.6 Factors
fac <- factor(c("good", "better", "best"))
levels(fac)[1] "best" "better" "good"
nlevels(fac)[1] 3
2.3.0.7 List of lists
You can create a list of lists (a list of vectors):
myList <- list(smallLetters=letters[1:10],
capitalLetters=LETTERS[1:10],
numbers=1:5)
myList$smallLetters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
$capitalLetters
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
$numbers
[1] 1 2 3 4 5
If you want to choose lists, you can do it with a single bracket.
# Accessing multiple elements
myList[1:2]$smallLetters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
$capitalLetters
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
If you want to choose single elements, you need double brackets:
# Accessing single elements
myList[2][2]$<NA>
NULL
myList[[2]][2][1] "B"
myList[[1:2]][1] "b"
2.3.0.8 Matrices
m <- matrix(data=1:12, nrow = 3, ncol = 4)
m [,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
Element wise operations:
m [,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
# Element-wise operations
m * 0.5 [,1] [,2] [,3] [,4]
[1,] 0.5 2.0 3.5 5.0
[2,] 1.0 2.5 4.0 5.5
[3,] 1.5 3.0 4.5 6.0
m [,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
# Element-wise operations
m * c(1,2,3) [,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 4 10 16 22
[3,] 9 18 27 36
matrix/ vector multiplication
m [,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
# Matrix multiplication
m %*% c(1,2,3,4) [,1]
[1,] 70
[2,] 80
[3,] 90
Good explanation of Matrix multiplication: Eli Bendersky’s website
2.3.0.9 Data frames
# Creating a data frame
family.frame <- data.frame(index = 1:3, firstname = myfamily, surname = rep("Meier", 3))library(knitr)
kable(family.frame)| index | firstname | surname |
|---|---|---|
| 1 | Paula | Meier |
| 2 | Ali | Meier |
| 3 | Emma | Meier |
A comprehensive list of data types is given in the Python documentation.
2.3.0.10 Character aka String
firstname = 'Pipi'
surname = 'Langstrumpf'
name = firstname+surname
print(name)PipiLangstrumpf
name = firstname+' '+surname
print(name)Pipi Langstrumpf
2.3.0.11 Numeric
import numpy as np
a = 1
b = 2
c = a + b
print(c)3
d = b*c/a
print(d)6.0
e = d ** 2
print(e)36.0
f = int(3.14)
print(f)3
g = float(3.14)
print(g)3.14
h = np.nan
float(h)nan
2.3.0.12 Logical
a = True
b = False
# or
a or bTrue
# and
a and bFalse
# not
not aFalse
4 == 5False
4 < 5True
4 >= 4True
2.3.0.13 Vectors aka list
# a simple numeric vector is similar to a list in python
myvector = [1,4,6]
print(myvector)[1, 4, 6]
# a vector of type character
myfamily = ["Paula", "Ali", "Emma"]
print(myfamily)['Paula', 'Ali', 'Emma']
# get the first element of a vector
myfamily[0]'Paula'
# append another item:
myfamily.append("Meier")
print(myfamily)['Paula', 'Ali', 'Emma', 'Meier']
# create a sequence
odd = list(range(1, 10, 2)) # start, stop, step
print(odd)[1, 3, 5, 7, 9]
# create a boring sequence
boring = list(range(10))
print(boring)[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# concatenate vectors:
longervector = myvector + odd
print(longervector)[1, 4, 6, 1, 3, 5, 7, 9]
2.3.0.14 Matrices aka Array
import numpy as np
m = np.arange(1, 13).reshape(3, 4)
print(m)[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]]
Element wise operations:
# Element-wise operations
result = m * 0.5
print(result)[[0.5 1. 1.5 2. ]
[2.5 3. 3.5 4. ]
[4.5 5. 5.5 6. ]]
m = np.arange(1, 13).reshape(3, 4)
n = np.array([1, 2, 3, 4])
# Element-wise operations
result = m * n
print(result)[[ 1 4 9 16]
[ 5 12 21 32]
[ 9 20 33 48]]
matrix/ vector multiplication
m = np.arange(1, 13).reshape(3, 4)
v = np.array([1, 2, 3, 4])
# Matrix multiplication
result = np.dot(m, v)
print(result)[ 30 70 110]
# or:
result = np.matmul(m, v)
print(result)[ 30 70 110]
Good explanation of Matrix multiplication: Eli Bendersky’s website
2.3.0.15 Dataframes
Dataframes in Python are defined in the package Pandas.
import pandas as pd
# Creating a data frame from a list
myfamily = ['Paula', 'Ali', 'Emma']
lastname = ['Meier']*3
family_frame = pd.DataFrame(list(zip(myfamily,lastname)),columns=['firstname','surname'])
family_frame firstname surname
0 Paula Meier
1 Ali Meier
2 Emma Meier
3 PANGAEA - Introduction
A link will follow to the slides
4 Basics R and Python
4.1 Preparing your data for analysis
4.1.0.1 Reading the data
Usually, you would use the “csv” format to read data into R. Sometimes you have other formats, e.g. “txt”. Optimally, you have exactly one line for the header. Make sure to choose the correct separator. You also have to tell R whether your file has a header or not.
df <- read.table("beedata.csv")Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 2 did not have 2 elements
That did not work. Oh, yes, the separator!
df <- read.table("beedata.csv", sep = ",")| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 |
|---|---|---|---|---|---|---|---|---|---|
| time | hive | h | t_i_1 | t_i_2 | t_i_3 | t_i_4 | t_i_5 | t_o | weight_kg |
| 1.565081418e+18 | 4 | NA | 27.25 | 26.375 | 23.4375 | 23.5 | 23.25 | 21.625 | NA |
| 1.565081428e+18 | 4 | NA | 27.25 | 26.4375 | 23.5 | 23.5 | 23.25 | 21.625 | NA |
| 1.565081431e+18 | 13 | 59.0554 | 22.875 | 22.1875 | 23 | 21.8125 | 22.6875 | 22.875 | 10.779 |
| 1.565081438e+18 | 4 | NA | 27.25 | 26.375 | 23.5 | 23.5625 | 23.25 | 21.625 | NA |
Looks ok, but we forgot the heading:
df <- read.table("beedata.csv", sep = ",", header = T)| time | hive | h | t_i_1 | t_i_2 | t_i_3 | t_i_4 | t_i_5 | t_o | weight_kg |
|---|---|---|---|---|---|---|---|---|---|
| 1.565081e+18 | 4 | NA | 27.250 | 26.3750 | 23.4375 | 23.5000 | 23.2500 | 21.625 | NA |
| 1.565081e+18 | 4 | NA | 27.250 | 26.4375 | 23.5000 | 23.5000 | 23.2500 | 21.625 | NA |
| 1.565081e+18 | 13 | 59.0554 | 22.875 | 22.1875 | 23.0000 | 21.8125 | 22.6875 | 22.875 | 10.779 |
| 1.565081e+18 | 4 | NA | 27.250 | 26.3750 | 23.5000 | 23.5625 | 23.2500 | 21.625 | NA |
| 1.565081e+18 | 4 | NA | 27.250 | 26.3750 | 23.5000 | 23.5625 | 23.2500 | 21.625 | NA |
That looks good!
Alternatively, we can use “read.csv”. In this case, the default parameters work for us.
df <- read.csv("beedata.csv")Take care, sometimes you don’t get an error, but your data is screwed up anyway:
df <- read.csv("sampleData.csv", sep = " ")| name. | age. | subject. | year |
|---|---|---|---|
| Li; | 18; | chemistry; | 1 |
| Svenson,Jacob; | 25; | psychology; | 12 |
| Raphaela; | 23; | psychology; | 1 |
Use the head function to inspect your data.
4.1.0.2 Transforming entire columns
The columns/ variables of a dataframe basically behave like entries in a named list. Each element of this list is a vector of a specific type. You can inspect the structure of a dataframe using the function “str”.
df <- read.csv("beedata.csv")
str(df)'data.frame': 171299 obs. of 10 variables:
$ time : num 1.57e+18 1.57e+18 1.57e+18 1.57e+18 1.57e+18 ...
$ hive : int 4 4 13 4 4 4 4 4 4 4 ...
$ h : num NA NA 59.1 NA NA ...
$ t_i_1 : num 27.2 27.2 22.9 27.2 27.2 ...
$ t_i_2 : num 26.4 26.4 22.2 26.4 26.4 ...
$ t_i_3 : num 23.4 23.5 23 23.5 23.5 ...
$ t_i_4 : num 23.5 23.5 21.8 23.6 23.6 ...
$ t_i_5 : num 23.2 23.2 22.7 23.2 23.2 ...
$ t_o : num 21.6 21.6 22.9 21.6 21.6 ...
$ weight_kg: num NA NA 10.8 NA NA ...
E.g. converting kg to g:
df$weight_g <- df$weight_kg*1000Marking high temperature values:
df$highTemp <- df$t_i_1>25| time | hive | h | t_i_1 | weight_kg | weight_g | highTemp |
|---|---|---|---|---|---|---|
| 1.565081e+18 | 4 | NA | 27.250 | NA | NA | TRUE |
| 1.565081e+18 | 4 | NA | 27.250 | NA | NA | TRUE |
| 1.565081e+18 | 13 | 59.0554 | 22.875 | 10.779 | 10779 | FALSE |
| 1.565081e+18 | 4 | NA | 27.250 | NA | NA | TRUE |
| 1.565081e+18 | 4 | NA | 27.250 | NA | NA | TRUE |
Dealing with the timestamp (nanoseconds to seconds)
df$time <- as.POSIXct(df$time/1000000000, origin="1970-01-01")| time | hive | h | t_i_1 | t_i_2 | t_i_3 | t_i_4 | t_i_5 | t_o | weight_kg | weight_g | highTemp |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2019-08-06 10:50:18 | 4 | NA | 27.250 | 26.3750 | 23.4375 | 23.5000 | 23.2500 | 21.625 | NA | NA | TRUE |
| 2019-08-06 10:50:28 | 4 | NA | 27.250 | 26.4375 | 23.5000 | 23.5000 | 23.2500 | 21.625 | NA | NA | TRUE |
| 2019-08-06 10:50:31 | 13 | 59.0554 | 22.875 | 22.1875 | 23.0000 | 21.8125 | 22.6875 | 22.875 | 10.779 | 10779 | FALSE |
| 2019-08-06 10:50:38 | 4 | NA | 27.250 | 26.3750 | 23.5000 | 23.5625 | 23.2500 | 21.625 | NA | NA | TRUE |
| 2019-08-06 10:50:48 | 4 | NA | 27.250 | 26.3750 | 23.5000 | 23.5625 | 23.2500 | 21.625 | NA | NA | TRUE |
4.1.0.3 Reading data
One easy way to read ascii data in Python is via Pandas. Many different formats can be read, e.g. “csv”, “txt” or “xlsx”. Ideally, you have exactly one line for the header. Data formats can be very divers. However, there are as many options to set before reading the data. Make sure to choose the correct separator.
# import package
import pandas as pd
# open the data as dataframe
df = pd.read_csv('beedata.csv')
# use head function to see the first 5 rows of the dataframe
df.head() time hive h t_i_1 ... t_i_4 t_i_5 t_o weight_kg
0 1.565081e+18 4 NaN 27.250 ... 23.5000 23.2500 21.625 NaN
1 1.565081e+18 4 NaN 27.250 ... 23.5000 23.2500 21.625 NaN
2 1.565081e+18 13 59.0554 22.875 ... 21.8125 22.6875 22.875 10.779
3 1.565081e+18 4 NaN 27.250 ... 23.5625 23.2500 21.625 NaN
4 1.565081e+18 4 NaN 27.250 ... 23.5625 23.2500 21.625 NaN
[5 rows x 10 columns]
# use describe function to get an overview of the values of the dataframe
df.describe() time hive ... t_o weight_kg
count 1.712990e+05 171299.000000 ... 155142.000000 111106.000000
mean 1.565408e+18 8.382956 ... 18.709573 28.435138
std 1.668141e+14 4.523060 ... 31.252958 10.019183
min 1.565081e+18 3.000000 ... -0.062500 -59.093070
25% 1.565285e+18 4.000000 ... 14.625000 24.154370
50% 1.565363e+18 12.000000 ... 18.250000 24.832320
75% 1.565592e+18 12.000000 ... 20.937500 39.548892
max 1.565686e+18 14.000000 ... 2072.500000 69.451870
[8 rows x 10 columns]
4.1.0.4 Transforming entire columns
The columns/ variables of a dataframe are called dataseries. Each item of each dataseries is of a specific type. You can inspect the structure of a dataframe using the function “info”.
# open the data as dataframe
df = pd.read_csv('beedata.csv')
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171299 entries, 0 to 171298
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 time 171299 non-null float64
1 hive 171299 non-null int64
2 h 95156 non-null float64
3 t_i_1 155193 non-null float64
4 t_i_2 155203 non-null float64
5 t_i_3 155232 non-null float64
6 t_i_4 155148 non-null float64
7 t_i_5 133220 non-null float64
8 t_o 155142 non-null float64
9 weight_kg 111106 non-null float64
dtypes: float64(9), int64(1)
memory usage: 13.1 MB
# you can call each dataseries via it's header, e.g.
df['weight_kg']0 NaN
1 NaN
2 10.77900
3 NaN
4 NaN
...
171294 24.31022
171295 NaN
171296 24.31170
171297 24.30817
171298 NaN
Name: weight_kg, Length: 171299, dtype: float64
E.g. converting kg to g:
df['weight_g'] = df['weight_kg']*1000.
df.head() time hive h t_i_1 ... t_i_5 t_o weight_kg weight_g
0 1.565081e+18 4 NaN 27.250 ... 23.2500 21.625 NaN NaN
1 1.565081e+18 4 NaN 27.250 ... 23.2500 21.625 NaN NaN
2 1.565081e+18 13 59.0554 22.875 ... 22.6875 22.875 10.779 10779.0
3 1.565081e+18 4 NaN 27.250 ... 23.2500 21.625 NaN NaN
4 1.565081e+18 4 NaN 27.250 ... 23.2500 21.625 NaN NaN
[5 rows x 11 columns]
Marking high temperature values:
df['highTemp'] = df['t_i_1']>25Data can have various date and time formats. Here, we have timesteps of nanoseconds since 1970-01-01. Thus, dealing with the timestamp (nanoseconds to seconds)
df['time'] = pd.to_datetime(df['time'], unit='ns', origin=pd.Timestamp('1970-01-01'))
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171299 entries, 0 to 171298
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 time 171299 non-null datetime64[ns]
1 hive 171299 non-null int64
2 h 95156 non-null float64
3 t_i_1 155193 non-null float64
4 t_i_2 155203 non-null float64
5 t_i_3 155232 non-null float64
6 t_i_4 155148 non-null float64
7 t_i_5 133220 non-null float64
8 t_o 155142 non-null float64
9 weight_kg 111106 non-null float64
10 weight_g 111106 non-null float64
11 highTemp 171299 non-null bool
dtypes: bool(1), datetime64[ns](1), float64(9), int64(1)
memory usage: 14.5 MB
df.head() time hive h ... weight_kg weight_g highTemp
0 2019-08-06 08:50:18 4 NaN ... NaN NaN True
1 2019-08-06 08:50:28 4 NaN ... NaN NaN True
2 2019-08-06 08:50:31 13 59.0554 ... 10.779 10779.0 False
3 2019-08-06 08:50:38 4 NaN ... NaN NaN True
4 2019-08-06 08:50:48 4 NaN ... NaN NaN True
[5 rows x 12 columns]
Take care, sometimes you don’t get an error, but your data is screwed up anyway:
df.head() time hive h ... weight_kg weight_g highTemp
0 2019-08-06 08:50:18 4 NaN ... NaN NaN True
1 2019-08-06 08:50:28 4 NaN ... NaN NaN True
2 2019-08-06 08:50:31 13 59.0554 ... 10.779 10779.0 False
3 2019-08-06 08:50:38 4 NaN ... NaN NaN True
4 2019-08-06 08:50:48 4 NaN ... NaN NaN True
[5 rows x 12 columns]
Use the head function to inspect your data.
4.2 Subsetting data
You can use squared brackets and boolean expressions to choose lines and columns: df[<expression to choose lines>,<expression to choose columns>]
Select and plot data from hive 4:
df.4 <- df[df$hive==4,]
plot(df.4$time, df.4$t_o, ylim=c(0,40))Select only the first 1000 lines:
df.4.sub <- df.4[1:1000,]
plot(df.4.sub$time, df.4.sub$t_o, ylim=c(0,40))Delete columns: / Choose only some columns
names(df) [1] "time" "hive" "h" "t_i_1" "t_i_2" "t_i_3"
[7] "t_i_4" "t_i_5" "t_o" "weight_kg" "weight_g" "highTemp"
df.some <- df[, c(1,2,10)]| time | hive | weight_kg |
|---|---|---|
| 2019-08-06 10:50:18 | 4 | NA |
| 2019-08-06 10:50:28 | 4 | NA |
| 2019-08-06 10:50:31 | 13 | 10.779 |
| 2019-08-06 10:50:38 | 4 | NA |
| 2019-08-06 10:50:48 | 4 | NA |
You can also use:
df.same <- df[, c("time", "hive", "weight_kg")]| time | hive | weight_kg |
|---|---|---|
| 2019-08-06 10:50:18 | 4 | NA |
| 2019-08-06 10:50:28 | 4 | NA |
| 2019-08-06 10:50:31 | 13 | 10.779 |
| 2019-08-06 10:50:38 | 4 | NA |
| 2019-08-06 10:50:48 | 4 | NA |
or
df.or <- df[, -c(3:9,11)]| time | hive | weight_kg | highTemp |
|---|---|---|---|
| 2019-08-06 10:50:18 | 4 | NA | TRUE |
| 2019-08-06 10:50:28 | 4 | NA | TRUE |
| 2019-08-06 10:50:31 | 13 | 10.779 | FALSE |
| 2019-08-06 10:50:38 | 4 | NA | TRUE |
| 2019-08-06 10:50:48 | 4 | NA | TRUE |
4.2.0.1 Useful functions
- summary
- head
- tail
- sum
- mean
You can use squared brackets and boolean expressions to choose lines and columns.
Select and plot data from hive 4:
df = pd.read_csv('beedata.csv')
df_4 = df[df['hive']==4]
import matplotlib.pyplot as plt
# quick plot with default arguments
plt.plot(df_4['time'],df_4['t_o'])For more plotting options see https://matplotlib.org/stable/gallery/index.html
fig, ax = plt.subplots()
ax.plot(df_4['time'],df_4['t_o'],'x')
ax.set(ylim=(0, 40))[(0.0, 40.0)]
plt.show()Select only the first 1000 lines:
df_4_sub = df_4[1:1000]
# combine head and atil and show the first and last 2 rows
pd.concat([df_4_sub.head(2),df_4_sub.tail(2)]) time hive h t_i_1 ... t_i_4 t_i_5 t_o weight_kg
1 1.565081e+18 4 NaN 27.2500 ... 23.5000 23.2500 21.6250 NaN
3 1.565081e+18 4 NaN 27.2500 ... 23.5625 23.2500 21.6250 NaN
1145 1.565091e+18 4 NaN 31.5625 ... 29.0625 28.8125 27.3750 NaN
1146 1.565091e+18 4 NaN 31.6250 ... 29.1250 28.8125 27.3125 NaN
[4 rows x 10 columns]
Delete columns: / Choose only some columns
df.columnsIndex(['time', 'hive', 'h', 't_i_1', 't_i_2', 't_i_3', 't_i_4', 't_i_5', 't_o',
'weight_kg'],
dtype='object')
df_some = df[["time", "hive", "weight_kg"]]
pd.concat([df_some.head(2),df_some.tail(2)]) time hive weight_kg
0 1.565081e+18 4 NaN
1 1.565081e+18 4 NaN
171297 1.565686e+18 12 24.30817
171298 1.565686e+18 4 NaN
4.2.0.2 Useful functions
- info
- head
- tail
- describe
- columns
4.3 Conditions - If and Else
4.3.0.1 If
Simple if:
a <- 19
if(a >= 18){
print("Yes, you are allowed to see this content.")
}[1] "Yes, you are allowed to see this content."
if and else:
goodWeather <- TRUE
if (goodWeather){
print("Let's go to the beach!")
} else{
print("Let's eat pizza.")
}[1] "Let's go to the beach!"
Else-if:
do.you.want.this <- "chocolate"
#do.you.want.this <- "cookies"
#do.you.want.this <- "carrots"
if (do.you.want.this == "chocolate"){
print("Yes, I love chocolate!")
} else if (do.you.want.this == "cookies"){
print("Yes, if they are with chocolate.")
} else {
print("Hm, not sure. Do you have chocolate?")
}[1] "Yes, I love chocolate!"
library(lubridate)
birthday <- as.POSIXct("2019-08-06 10:50:18")
what.do.you.want.to.know <- "year"
#what.do.you.want.to.know <- "month"
#what.do.you.want.to.know <- "chocolate"
if (what.do.you.want.to.know == "year"){
print(year(birthday))
} else if (what.do.you.want.to.know == "month"){
print(month(birthday))
} else {
print("Sorry, what do you want to know?")
}[1] 2019
4.3.0.2 If
Note: if, else, etc. do not use (),{} or [] to mark its beginning and endings. Python uses indentations.
Simple if:
a = 19
if a >= 18:
print("Yes, you are allowed to see this content.")Yes, you are allowed to see this content.
if and else:
goodWeather = True
if goodWeather:
print("Let's go to the beach!")
else:
print("Let's eat pizza.")Let's go to the beach!
elif (short for else if):
do_you_want_this = "chocolate"
#do_you_want_this = "cookies"
#do_you_want_this = "carrots"
if do_you_want_this == "chocolate":
print("Yes, I love chocolate!")
elif do_you_want_this == "cookies":
print("Yes, if they are with chocolate.")
else:
print("Hm, not sure. Do you have chocolate?")Yes, I love chocolate!
from datetime import datetime
birthday = datetime.fromisoformat('2019-08-06 10:50:18')
what_do_you_want_to_know = "year"
#what_do_you_want_to_know = "month"
#what_do_you_want_to_know = "chocolate"
if what_do_you_want_to_know == "year":
print(birthday.year)
elif what_do_you_want_to_know == "month":
print(birthday.month)
else:
print("Sorry, what do you want to know?")2019
4.4 For and While
4.4.0.1 For
ages <- c(31, 29, 5)
total.age <- 0
for(i in ages){
total.age <- total.age + i
}
total.age[1] 65
myfamily <- list(Paula=31, Ali=29, Emma=5)
age <- 0
for (i in 1:length(myfamily)){
age <- age + myfamily[[i]]
}
age[1] 65
apply it to columns:
for (i in 1:ncol(df)){
name <- names(df)[i]
colclass <- class(df[,i])
print(paste(name, ":", colclass))
}[1] "time : POSIXct" "time : POSIXt"
[1] "hive : integer"
[1] "h : numeric"
[1] "t_i_1 : numeric"
[1] "t_i_2 : numeric"
[1] "t_i_3 : numeric"
[1] "t_i_4 : numeric"
[1] "t_i_5 : numeric"
[1] "t_o : numeric"
[1] "weight_kg : numeric"
[1] "weight_g : numeric"
[1] "highTemp : logical"
transform long code …
df <- read.csv("beedata.csv")
print("mean of h:")
mean(df$h)
print("mean of t_i_1:")
mean(df$t_i_1)
print("mean of weight_kg:")
mean(df$weight_kg)
print("missing h values:")
sum(is.na(df$h))
print("missing t_i_1 values:")
sum(is.na(df$h))
print("missing weight_kg values:")
sum(is.na(df$h))… into shorter code
to.analyse <- c("h", "t_i_1", "weight_kg")
for(i in 1:length(to.analyse)){
print(paste("mean of", to.analyse[i], ":"))
print(mean(to.analyse[i]))
print(paste("missing", to.analyse[i], "values:"))
print(sum(is.na(df[,to.analyse[i]])))
}4.4.0.2 While
a <- 2
while (a<1000){
a <- a^2
print(a)
}[1] 4
[1] 16
[1] 256
[1] 65536
4.4.0.3 Do-while
a <- 10
repeat{
a <- a - sqrt(a)
print(a)
if(sqrt(a) > a){
break
}
}[1] 6.837722
[1] 4.222818
[1] 2.167869
[1] 0.6955003
bad example for Do-while:
a <- 10
repeat{
a <- a - sqrt(a)
print(a)
if(a < 0){
break
}
}[1] 6.837722
[1] 4.222818
[1] 2.167869
[1] 0.6955003
[1] -0.1384663
because it breaks for edge cases:
a <- -10
repeat{
a <- a - sqrt(a)
print(a)
if(a < 0){
break
}
}Warning in sqrt(a): NaNs produced
[1] NaN
Error in if (a < 0) {: missing value where TRUE/FALSE needed
4.4.0.4 For vs. While
- if you do not know the number of iterations needed, use while
- else, always use for!
- it is easier to create a while-loop that never stops, than creating a for-loop that never stops.
4.4.0.5 For
Note: for, while, etc. do not use (),{} or [] to mark its beginning and endings. Python uses indentations.
ages = [31, 29, 5]
total_age = 0
for i in ages:
total_age = total_age + i
print(total_age)65
same as:
ages = [31, 29, 5]
total_age = 0
for age in ages:
total_age = total_age + age
print(total_age)65
The next example uses a dictionary with {key:value} structure.
myfamily = {'Paula':31, 'Ali':29, 'Emma':5}
age = 0
for key, value in myfamily.items():
age = age + value
print(age)65
apply it to dataframes:
import pandas as pd
df = pd.read_csv('beedata.csv')
for ind,value in df['t_i_1'].items():
if ind > 6 and ind <= 10:
print('ind = ',ind)
print('value = ',value)ind = 7
value = 27.25
ind = 8
value = 27.25
ind = 9
value = 27.25
ind = 10
value = 22.6875
transform long code …
import pandas as pd
df = pd.read_csv('beedata.csv')
print("mean of h:")mean of h:
df['h'].mean()np.float64(56.397063812371265)
print("mean of t_i_1:")mean of t_i_1:
df['t_i_1'].mean()np.float64(25.847728319382963)
print("mean of weight_kg:")mean of weight_kg:
df['weight_kg'].mean()np.float64(28.43513778723021)
print("missing h values:")missing h values:
df['h'].isna().sum()np.int64(76143)
print("missing t_i_1 values:")missing t_i_1 values:
df['t_i_1'].isna().sum()np.int64(16106)
print("missing weight_kg values:")missing weight_kg values:
df['weight_kg'].isna().sum()np.int64(60193)
… into shorter code
to_analyse = ["h", "t_i_1", "weight_kg"]
for i in to_analyse:
print("mean of", i, df[i].mean())
print("sum of missing values of", i, df[i].isna().sum())mean of h 56.397063812371265
sum of missing values of h 76143
mean of t_i_1 25.847728319382963
sum of missing values of t_i_1 16106
mean of weight_kg 28.43513778723021
sum of missing values of weight_kg 60193
4.4.0.6 While
a = 2
while (a<1000):
a = a**2
print(a)4
16
256
65536
4.4.0.7 Do-while
import math
a = 10
while True:
a = a - math.sqrt(a)
print(a)
if math.sqrt(a) > a:
break6.83772233983162
4.222818452528758
2.167868708002445
0.6955003070910275
bad example for Do-while:
import math
a = 10
while True:
a = a - math.sqrt(a)
print(a)
if a < 0:
break6.83772233983162
4.222818452528758
2.167868708002445
0.6955003070910275
-0.13846630320642772
because it breaks for edge cases:
import math
a = -10
while True:
try:
a = a - math.sqrt(a)
except ValueError:
print("math domain error: sqrt(a) with negative a is not defined")
break
print(a)
if a < 0:
breakmath domain error: sqrt(a) with negative a is not defined
4.4.0.8 For vs. While
- if you do not know the number of iterations needed, use while
- else, always use for!
- it is easier to create a while-loop that never stops, than creating a for-loop that never stops.
4.5 Writing own functions
The basic syntax to write functions looks like this:
myPlus <- function(a, b){
return(a + b)
}
myPlus(1,2)[1] 3
You can use default parameters:
myPlus <- function(a=1, b=1){
return(a + b)
}
myPlus(1,2)[1] 3
myPlus()[1] 2
If you want to return multiple values, you can use a named list:
dataframe.info <- function(df){
cells.count <- ncol(df)*nrow(df)
return(list(columns=ncol(df), rows=nrow(df), cells= cells.count))
}
dataframe.info(df)$columns
[1] 12
$rows
[1] 171299
$cells
[1] 2055588
The basic syntax to write functions looks like this:
def myPlus(a, b):
c = a+b
return c
myPlus(1,2)3
You can use default parameters:
def myPlus(a=1, b=1):
c = a+b
return c
myPlus(1,2)3
myPlus()2
Example to return multiple values:
def dataframe_info(df):
rows_count = df.shape[0]
columns_count = df.shape[1]
values_count = df.shape[0]*df.shape[1]
return columns_count, rows_count, values_count
import pandas as pd
df = pd.read_csv('beedata.csv')
dataframe_info(df)(10, 171299, 1712990)
4.6 Debugging
Here is a buggy version of the function “funny.words.”
funny.words <- function(s = "summer", count = 3){
s = "summer"
s1 <- ""
for(i in 1:2){
for(i in 1:count){
print(substr(s, i, i))
s1 <- paste(s1, substr(s, i, i), sep = "")
}
}
return(s1)
}
funny.words()[1] "s"
[1] "u"
[1] "m"
[1] "s"
[1] "u"
[1] "m"
[1] "sumsum"
funny.words("banana", 5)[1] "s"
[1] "u"
[1] "m"
[1] "m"
[1] "e"
[1] "s"
[1] "u"
[1] "m"
[1] "m"
[1] "e"
[1] "summesumme"
Since this code consists of a function and a loop, we cannot run line by line to find the mistakes. However, we can create breakpoints by clicking next to the line numbering (left).
Then we can click on “source”” to execute all code until the first breakpoint.
In the console, there will be a new set of buttons:
- If you click on Next, R will execute the next line. (e.g. line 6)
- If you click on the second button, R will step in the current function call, so it will basically jump into an other function. (e.g. into the print function)
- If you click on the third button, R will execute the rest of the current function or loop. (e.g. line 6 and 7)
- If you click on “continue”“, R will run until we come across the next breakpoint. (e.g. in the next round of the loop or in line 10)
- If you click on “Stop”, R will exit the debug mode.
Here is a buggy version of the function “funny_words.”. We are using the package ipdb for debugging.
Implement ipdb.set_trace() into the function to set a Breakpoint in the execution of the loop. You can step through the code and print out variables.
# import ipdb
def funny_words(s = "summer", count = 3):
s = "summer"
s1 = ""
for i in range(2):
for i in range(count):
# ipdb.set_trace() # Pause the execution to inspect 's', 'i' and 'i'
print(s, i, i)
s1 =[s, i, i]
return s1try out:
funny_words()summer 0 0
summer 1 1
summer 2 2
summer 0 0
summer 1 1
summer 2 2
['summer', 2, 2]
and
funny_words("banana", 5)summer 0 0
summer 1 1
summer 2 2
summer 3 3
summer 4 4
summer 0 0
summer 1 1
summer 2 2
summer 3 3
summer 4 4
['summer', 4, 4]
You can use the following debugging commands to e.g. print variables or continue the loop until the next breakpoint:
- p
- c: continue to the next breakpoint
- n: move to the next line of code
- q: quit the debugging session
4.7 Piping
Some people complain about all the brackets in the R syntax. With using pipes you can get rid of them. Pipes might also reflect your workflow more naturally than the usual syntax.
You can find the standard pipe operator in multiple packages, but you will get more options when using the magrittr packages.
Here is a simple example
library(magrittr)
x <- 9
# Calculate the square-root of x
sqrt(x)[1] 3
# Calculate it using pipes
x %>% sqrt[1] 3
In case you want to update x, you can use:
x <- 9
# Calculate the square-root of x and update x
x <- sqrt(x)
x[1] 3
# Calculate it using pipes
x <- 9
x %<>% sqrt
x[1] 3
This is an example with two functions and additional arguments:
nrow(subset(df, hive==4))[1] 60193
df %>% subset(hive==4) %>% nrow[1] 60193
5 PANGAEA - downloading data
link to slides will follow
6 Data Cleaning
link to scripts will follow
7 Statistics
7.1 Definition - Simple Linear Regression
\[ \begin{equation} y_i = \beta_0 + \beta_1x_i + \epsilon_i, \qquad i=1,...,n \end{equation} \] where \(y\) is the dependent variable, \(x\) is the independent variable (also called explanatory variable), \(\beta_0\) is the intercept parameter, \(\beta_1\) is the slope parameter and \(\epsilon\sim N(0,\sigma^2)\) is the error coefficient.
7.2 Sample data
x <- seq(0,5,0.1)
y <- x + rnorm(length(x))
plot(x, y)7.3 Fitting a Linear Model
mod <- lm(y~x)
summary(mod)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.7296 -0.6980 -0.1558 0.7360 2.3166
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.001092 0.277199 0.004 0.997
x 0.980949 0.095548 10.267 8.35e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.004 on 49 degrees of freedom
Multiple R-squared: 0.6826, Adjusted R-squared: 0.6762
F-statistic: 105.4 on 1 and 49 DF, p-value: 8.345e-14
plot(x,y)
abline(reg = mod)7.4 Linear Model - Assumptions
- Linear relationship between x and y
- Independence of residuals
- Constant variance of residuals
- Normality: Residuals are normally distributed
7.5 Assumption 1 - Linear Relationship
- use a scatterplot to check
- if this assumption is violated we have unrealistic predictions and estimates of the standard deviation
- How to deal with this? You can transform the variables or alter the model.
plot(x,y)7.6 Assumption 2 - Independence of Residuals
- Plot residuals in time/ observation order / based on observation location / …
- You expect that they are equally scattered around 0
- You can also use the Durban-Watson-Test
- Autocorrelation can have serious effects on the model
- You can add an AR correlation term to the model. Sometimes it also helps to add a further covariate.
res <- residuals(mod)
plot(x, res)7.7 Durban-Watson-Test
H0: There is no correlation among the residuals.
HA: The residuals are autocorrelated.
library(car)Loading required package: carData
durbinWatsonTest(mod) lag Autocorrelation D-W Statistic p-value
1 -0.06355024 2.096881 0.89
Alternative hypothesis: rho != 0
- The p-value is larger than 0.05, so we cannot reject the null hypothesis that there is no correlation between residuals.
7.8 Assumption 3 - Constant variance or errors
- Plot fitted values vs residuals
- You expect that they are equally scattered around 0
plot(fitted(mod),res)7.9 Assumption 4 - Normality of Residuals
- Use qq-plot to compare quantiles.
- Or use the Shapiro-Wilk-Test
- If you do not find a Normal Distribution, check for outliers or transform your variable.
qqPlot(res)[1] 24 45
7.10 Shapiro-Wilk-Test for Normality
- H0: The data is normally distributed
- HA: The data comes from an other distribution
shapiro.test(res)
Shapiro-Wilk normality test
data: res
W = 0.98899, p-value = 0.915
- We cannot reject the null-hypothesis that the data is normally distributed.
7.11 Other possibilities of checking assumptions
- Did you learn about this before? Did you hear about other options?
7.12 Definition - Linear Model with multiple predictors
\[\begin{equation}\label{eqn:linearregression} y_i = \beta_0 + \beta_1x_{1,i} + ... + \beta_px_{p,i} + \epsilon_i, \qquad i=1,...,n \end{equation}\] where \(y\) is the dependent variable, \(x_1 ... x_p\) are the independent variables (also called explanatory variables), \(\beta_0 ... \beta_p\) are the regression coefficients, \(\epsilon\sim N(0,\sigma^2)\) is the error coefficient and \(p \geq 1\).
7.13 The penguin data set
Artwork by @allison_horst
Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.
https://allisonhorst.github.io/palmerpenguins/
library(palmerpenguins)
Attaching package: 'palmerpenguins'
The following objects are masked from 'package:datasets':
penguins, penguins_raw
head(penguins)# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
https://github.com/mcnakhaee/palmerpenguins?tab=readme-ov-file
from palmerpenguins import load_penguins/home/diseng001/R/x86_64-pc-linux-gnu-library/4.5/reticulate/python/rpytools/loader.py:120: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
return _find_and_load(name, import_)
penguins = load_penguins()
penguins.head() species island bill_length_mm ... body_mass_g sex year
0 Adelie Torgersen 39.1 ... 3750.0 male 2007
1 Adelie Torgersen 39.5 ... 3800.0 female 2007
2 Adelie Torgersen 40.3 ... 3250.0 female 2007
3 Adelie Torgersen NaN ... NaN NaN 2007
4 Adelie Torgersen 36.7 ... 3450.0 female 2007
[5 rows x 8 columns]
7.14 Exercise
Using the penguin data set, fit a linear model to determine the relationship between the body mass index and the flipper length.
The flipper length differs accross penguin species. How could you add this to your model?
7.14.1 Solution
Remove missing values:
library(dplyr)
Attaching package: 'dplyr'
The following object is masked from 'package:car':
recode
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
penguins_filtered <- penguins %>%
filter(complete.cases(.))Fit a linear model:
y <- penguins_filtered$body_mass_g
x <- penguins_filtered$flipper_length_mm
mod <- lm(y~x)
summary(mod)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-1057.33 -259.79 -12.24 242.97 1293.89
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5872.09 310.29 -18.93 <2e-16 ***
x 50.15 1.54 32.56 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 393.3 on 331 degrees of freedom
Multiple R-squared: 0.7621, Adjusted R-squared: 0.7614
F-statistic: 1060 on 1 and 331 DF, p-value: < 2.2e-16
Check linear relationship:
plot(x,y)Check independence of residuals
res <- residuals(mod)
plot(x, res)library(car)
durbinWatsonTest(mod) lag Autocorrelation D-W Statistic p-value
1 -0.06161287 2.115873 0.304
Alternative hypothesis: rho != 0
Check constant variance or errors
plot(fitted(mod),res)Check Normality of Residuals
qqPlot(res)[1] 35 164
hist(res)shapiro.test(res)
Shapiro-Wilk normality test
data: res
W = 0.99237, p-value = 0.08633
Fit a new model to consider penguin species as well.
mod_add <- lm(penguins_filtered$body_mass_g ~ penguins_filtered$flipper_length_mm + penguins_filtered$species)
mod_spec <- lm(penguins_filtered$body_mass_g ~ penguins_filtered$flipper_length_mm * penguins_filtered$species)
summary(mod_spec)
Call:
lm(formula = penguins_filtered$body_mass_g ~ penguins_filtered$flipper_length_mm *
penguins_filtered$species)
Residuals:
Min 1Q Median 3Q Max
-900.90 -247.01 -25.53 197.19 1143.33
Coefficients:
Estimate
(Intercept) -2508.088
penguins_filtered$flipper_length_mm 32.689
penguins_filtered$speciesChinstrap -529.108
penguins_filtered$speciesGentoo -4166.117
penguins_filtered$flipper_length_mm:penguins_filtered$speciesChinstrap 1.884
penguins_filtered$flipper_length_mm:penguins_filtered$speciesGentoo 21.477
Std. Error
(Intercept) 892.413
penguins_filtered$flipper_length_mm 4.692
penguins_filtered$speciesChinstrap 1525.110
penguins_filtered$speciesGentoo 1431.581
penguins_filtered$flipper_length_mm:penguins_filtered$speciesChinstrap 7.864
penguins_filtered$flipper_length_mm:penguins_filtered$speciesGentoo 6.967
t value
(Intercept) -2.810
penguins_filtered$flipper_length_mm 6.967
penguins_filtered$speciesChinstrap -0.347
penguins_filtered$speciesGentoo -2.910
penguins_filtered$flipper_length_mm:penguins_filtered$speciesChinstrap 0.240
penguins_filtered$flipper_length_mm:penguins_filtered$speciesGentoo 3.083
Pr(>|t|)
(Intercept) 0.00525
penguins_filtered$flipper_length_mm 1.78e-11
penguins_filtered$speciesChinstrap 0.72887
penguins_filtered$speciesGentoo 0.00386
penguins_filtered$flipper_length_mm:penguins_filtered$speciesChinstrap 0.81077
penguins_filtered$flipper_length_mm:penguins_filtered$speciesGentoo 0.00223
(Intercept) **
penguins_filtered$flipper_length_mm ***
penguins_filtered$speciesChinstrap
penguins_filtered$speciesGentoo **
penguins_filtered$flipper_length_mm:penguins_filtered$speciesChinstrap
penguins_filtered$flipper_length_mm:penguins_filtered$speciesGentoo **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 368.4 on 327 degrees of freedom
Multiple R-squared: 0.7938, Adjusted R-squared: 0.7906
F-statistic: 251.7 on 5 and 327 DF, p-value: < 2.2e-16
anova(mod_add, mod_spec)Analysis of Variance Table
Model 1: penguins_filtered$body_mass_g ~ penguins_filtered$flipper_length_mm +
penguins_filtered$species
Model 2: penguins_filtered$body_mass_g ~ penguins_filtered$flipper_length_mm *
penguins_filtered$species
Res.Df RSS Df Sum of Sq F Pr(>F)
1 329 45843144
2 327 44391669 2 1451475 5.346 0.005193 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model with the interaction term between flipper length and species improves the model.
8 Plotting
8.1 Standard library
# subset data
df.4 <- df[df$hive==4,]
# plot temperature outside
plot(df.4$time, df.4$t_o, ylim=c(0,40),type = 'p', pch=4)
# choose colours
cl <- rainbow(5)
# choose colums
cols <- 4:8
# plot each column
for (i in 1:5){
lines(df.4$time, df.4[,cols[i]],col = cl[i],type = 'p', pch=4, ylim=c(0,40))
}
# add legend
legend("topright", legend=c(1, 2, 3, 4, 5, "outside"),
col=c(cl, "black"), pch = 4, lty = 0, cex=0.8)8.1.1 Points or Lines? (type)
- “p”: Points
- “l”: Lines
- “b”: Both
8.1.2 Line type (lty)
Source: http://www.sthda.com/english/wiki/line-types-in-r-lty ### Point types (pch)
Source: http://www.sthda.com/english/wiki/r-plot-pch-symbols-the-different-point-shapes-available-in-r
8.2 ggplot
# subset data
df.4 <- df[df$hive==4,]
# choose columns
df.4.cols <- df.4[,c(1,4:9)]
# reshape data
library(reshape)
Attaching package: 'reshape'
The following object is masked from 'package:dplyr':
rename
The following object is masked from 'package:lubridate':
stamp
mdf <- melt(df.4.cols, id=c("time"))
# plot data
library(ggplot2)
ggplot(data = mdf, aes(x=time, y=value)) + geom_line(aes(colour=variable)) + ylim(c(0, 40))9 R-Markdown
R-Markdown is a great way to document your code and to write reports about your results.
9.1 Output formats
The three most common output options for R-Markdown are html, pdf and word documents. If you want to create pdf documents, tex needs to be installed.
Depending on your choice, your document will have a different output parameter in the header:
# html
---
title: "R Markdown Example"
author: "Somebody"
date: "August 14, 2019"
output: html_document
---# word
---
title: "R Markdown Example"
author: "Somebody"
date: "August 14, 2019"
output: word_document
---There are many formatting options which you can specify in the header. This might be useful options for html:
---
title: "R Markdown Example"
author: "Somebody"
date: "August 14, 2019"
output:
html_document:
toc: true # print a table of content
toc_depth: 2 # maximal depth of the table of content
number_sections: true # automatically number the sections
code_folding: hide # have buttons to show/ hide code
theme: united # specify a theme
toc_float: true # generate a navigation bar
---9.2 Document structure
If you create a new R-Markdown document, R will automatically create a sample document. You can see that “#” can be used to create headers:
# Header 1
## Header 2
### Header 39.3 Code blocks
For code blocks you have many options, too. The two most important are:
- eval: If set to true, the code will be executed
- echo: If set to true, the output will be printed/ plotted
9.4 Lists
* unordered list
+ sub-item 1
+ sub-item 2
- sub-sub-item 1
* item 2
1. ordered list
2. item 2
i) sub-item 1- unordered list
- sub-item 1
- sub-item 2
- sub-sub-item 1
- item 2
- ordered list
- item 2
- sub-item 1
9.5 Insert an image
9.6 Further documentation
A good documentation of R-Markdown can be found here: https://bookdown.org/yihui/rmarkdown/
And here is a cheat sheet: https://www.rstudio.com/resources/cheatsheets/#rmarkdown
There is a lot more you can do with R-Markdown, see: https://rmarkdown.rstudio.com/gallery.html